In this module, we will learn how we can quickly analyse gene
expression data to summarise the pathway enrichment analysis, using
already published tools. This is a faster way for those databases that
are well documented in Org.Db package. Please note that
customized analysis is required for the transcriptomes that are not
present in this database.
The source of the dataset is from the following paper, we investigate the impact of root dwelling microbes on plants cultivated in lower photosynthetic active radiation: A microbiota–root–shoot circuit favours Arabidopsis growth over defence under suboptimal light
pkgs <- c("BiocManager", "tidyverse", "ggridges")
# install.packages(pkgs)
lapply(pkgs, require, character.only = TRUE)
## Loading required package: BiocManager
## Bioconductor version '3.16' is out-of-date; the current release version '3.17'
## is available with R version '4.3'; see https://bioconductor.org/install
## Loading required package: tidyverse
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Loading required package: ggridges
# Bioconductor packages
org <- "org.At.tair.db" # For arabidopsis
pkgs_bc <- c("clusterProfiler", "enrichplot", "pathview", org, "DOSE")
# lapply(pkgs_bc, install, character.only = TRUE)
lapply(pkgs_bc, require, character.only = TRUE)
## Loading required package: clusterProfiler
##
## clusterProfiler v4.6.2 For help: https://yulab-smu.top/biomedical-knowledge-mining-book/
##
## If you use clusterProfiler in published research, please cite:
## T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. The Innovation. 2021, 2(3):100141
##
## Attaching package: 'clusterProfiler'
##
## The following object is masked from 'package:purrr':
##
## simplify
##
## The following object is masked from 'package:stats':
##
## filter
##
## Loading required package: enrichplot
## Warning: package 'enrichplot' was built under R version 4.2.3
## Loading required package: pathview
##
## ##############################################################################
## Pathview is an open source software package distributed under GNU General
## Public License version 3 (GPLv3). Details of GPLv3 is available at
## http://www.gnu.org/licenses/gpl-3.0.html. Particullary, users are required to
## formally cite the original Pathview paper (not just mention it) in publications
## or products. For details, do citation("pathview") within R.
##
## The pathview downloads and uses KEGG data. Non-academic uses may require a KEGG
## license agreement (details at http://www.kegg.jp/kegg/legal.html).
## ##############################################################################
## Loading required package: org.At.tair.db
## Loading required package: AnnotationDbi
## Loading required package: stats4
## Loading required package: BiocGenerics
##
## Attaching package: 'BiocGenerics'
##
## The following objects are masked from 'package:lubridate':
##
## intersect, setdiff, union
##
## The following objects are masked from 'package:dplyr':
##
## combine, intersect, setdiff, union
##
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
##
## The following objects are masked from 'package:base':
##
## anyDuplicated, aperm, append, as.data.frame, basename, cbind,
## colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
## get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
## match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
## Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
## table, tapply, union, unique, unsplit, which.max, which.min
##
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
##
## Loading required package: IRanges
## Loading required package: S4Vectors
##
## Attaching package: 'S4Vectors'
##
## The following object is masked from 'package:clusterProfiler':
##
## rename
##
## The following objects are masked from 'package:lubridate':
##
## second, second<-
##
## The following objects are masked from 'package:dplyr':
##
## first, rename
##
## The following object is masked from 'package:tidyr':
##
## expand
##
## The following objects are masked from 'package:base':
##
## expand.grid, I, unname
##
##
## Attaching package: 'IRanges'
##
## The following object is masked from 'package:clusterProfiler':
##
## slice
##
## The following object is masked from 'package:lubridate':
##
## %within%
##
## The following objects are masked from 'package:dplyr':
##
## collapse, desc, slice
##
## The following object is masked from 'package:purrr':
##
## reduce
##
##
## Attaching package: 'AnnotationDbi'
##
## The following object is masked from 'package:clusterProfiler':
##
## select
##
## The following object is masked from 'package:dplyr':
##
## select
##
##
## Loading required package: DOSE
## DOSE v3.24.2 For help: https://yulab-smu.top/biomedical-knowledge-mining-book/
##
## If you use DOSE in published research, please cite:
## Guangchuang Yu, Li-Gen Wang, Guang-Rong Yan, Qing-Yu He. DOSE: an R/Bioconductor package for Disease Ontology Semantic and Enrichment analysis. Bioinformatics 2015, 31(4):608-609
df <- read.delim2("./data/Hou.et.al.txt", header = TRUE, row.names = 1, sep = "\t")
case_in <- "SC_Root"
id_deg <- str_detect(colnames(df), case_in)
lfc_mat <- df[,id_deg]
colnames(lfc_mat) <- str_replace_all(colnames(lfc_mat), paste0("\\.", case_in), "")
# degs <- lfc_mat %>% dplyr::filter(abs(as.numeric(`log2FC`)) >= 1 & as.numeric(`FDR`) <= 0.05) # Filter the ones that are significant
gene_list <- as.numeric(lfc_mat$log2FC)
names(gene_list) <- row.names(lfc_mat)
gene_list <- na.omit(gene_list) %>% sort(., decreasing = TRUE)
The package clusterProfiler provides onle line command
to perform GSEA analysis.
# One line for GSE
gse <- clusterProfiler::gseGO(geneList = gene_list,
ont ="ALL",
keyType = "TAIR",
nPerm = 10000,
minGSSize = 3,
maxGSSize = 800,
pvalueCutoff = 0.05,
verbose = TRUE,
OrgDb = org,
pAdjustMethod = "none")
## preparing geneSet collections...
## GSEA analysis...
## Warning in .GSEA(geneList = geneList, exponent = exponent, minGSSize =
## minGSSize, : We do not recommend using nPerm parameter incurrent and future
## releases
## Warning in fgsea(pathways = geneSets, stats = geneList, nperm = nPerm, minSize
## = minGSSize, : You are trying to run fgseaSimple. It is recommended to use
## fgseaMultilevel. To run fgseaMultilevel, you need to remove the nperm argument
## in the fgsea function call.
## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (1.52% of the list).
## The order of those tied genes will be arbitrary, which may produce unexpected results.
## leading edge analysis...
## done...
Lets see if our gene ontology terms corresponds to some biological function.
dotplot(gse,
showCategory = 10, split = ".sign") +
facet_grid(.~.sign)
Lets explore the top 5 categories in GO term.
emapplot(pairwise_termsim(gse), showCategory = 10)
Now lets look at the regulatory network of the gene set.
cnetplot(gse,
categorySize = "pvalue",
foldChange = gene_list,
showCategory = 2)
## Warning in cnetplot.enrichResult(x, ...): Use 'color.params = list(foldChange = your_value)' instead of 'foldChange'.
## The foldChange parameter will be removed in the next version.
## Scale for size is already present.
## Adding another scale for size, which will replace the existing scale.
## Warning: ggrepel: 430 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
Lets see the distibution of the p-values for the
corresponding pathways.
require(ggridges)
ridgeplot(gse) +
labs(x = "Gene enrichment")
## Picking joint bandwidth of 0.186
This is the most interesting plot where you can see the expression pattrerns of the genes on the ranked list. You can specifically trace the running enrichment score and understand the consistency of gene expression of the corresponding pathway.
gseaplot(gse, by = "all", title = gse$Description[1], geneSetID = 1)
gs1 <- unlist(str_split(gse$core_enrichment[1], "/"))
save(list = "gs1", file = "./gene_set_for_wounding.RData")
Now, lets dive into the the pathway analysis using the KEGG database.
Here, we have to first change the id that is compatible with the KEGG
database. We will be using bitr function to fetch the
corresponding ENTREZID from the TAIR id.
# Obtain the gene list for KEGG
keg_gene_list <- as.numeric(lfc_mat$log2FC)
names(keg_gene_list) <- row.names(lfc_mat)
# Convert gene IDs for gseKEGG function
# We will lose some genes here because not all IDs will be converted
ids <- bitr(names(keg_gene_list),
fromType = "TAIR",
toType = "ENTREZID",
OrgDb = org.At.tair.db)
## 'select()' returned 1:1 mapping between keys and columns
## Warning in bitr(names(keg_gene_list), fromType = "TAIR", toType = "ENTREZID", :
## 3% of input gene IDs are fail to map...
# Remove duplicate IDS (here I use "TAIR", but it should be whatever was selected as keyType)
dedup_ids = ids[!duplicated(ids[c("TAIR")]), ]
# Create a new dataframe df_keg which has only the genes which were successfully mapped using the bitr function above
df_keg <- lfc_mat[row.names(lfc_mat) %in% dedup_ids$TAIR, ]
# Create a new column in df_keg with the corresponding ENTREZ IDs
df_keg$Y <- dedup_ids$ENTREZID
# Create a vector of the gene unuiverse
kegg_gene_list <- as.numeric(df_keg$log2FC)
# Name vector with ENTREZ ids
names(kegg_gene_list) <- df_keg$Y
# omit any NA values
kegg_gene_list <- na.omit(kegg_gene_list)
# sort the list in decreasing order (required for clusterProfiler) for ranking
kegg_gene_list <- sort(kegg_gene_list, decreasing = TRUE)
# Run the KEGG enrichment
kegg_org <- "ath" # Note the short form of the organism
kkg <- gseKEGG(geneList = kegg_gene_list,
organism = kegg_org,
nPerm = 10000,
minGSSize = 3,
maxGSSize = 800,
pvalueCutoff = 0.05,
pAdjustMethod = "none",
keyType = "ncbi-geneid") # Keytype is
## Reading KEGG annotation online: "https://rest.kegg.jp/link/ath/pathway"...
## Reading KEGG annotation online: "https://rest.kegg.jp/list/pathway/ath"...
## Reading KEGG annotation online: "https://rest.kegg.jp/conv/ncbi-geneid/ath"...
## preparing geneSet collections...
## GSEA analysis...
## Warning in .GSEA(geneList = geneList, exponent = exponent, minGSSize =
## minGSSize, : We do not recommend using nPerm parameter incurrent and future
## releases
## Warning in fgsea(pathways = geneSets, stats = geneList, nperm = nPerm, minSize
## = minGSSize, : You are trying to run fgseaSimple. It is recommended to use
## fgseaMultilevel. To run fgseaMultilevel, you need to remove the nperm argument
## in the fgsea function call.
## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (1.43% of the list).
## The order of those tied genes will be arbitrary, which may produce unexpected results.
## leading edge analysis...
## done...
Here we see the transcriptional reprogramming in pathways.
dotplot(kkg, showCategory = 10, title = "Enriched Pathways" , split=".sign") +
facet_grid(.~.sign)
emapplot(pairwise_termsim(kkg))
cnetplot(kkg)
## Warning: ggrepel: 213 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
ridgeplot(kkg) + labs(x = "Enrichment")
## Picking joint bandwidth of 0.195
The enrichment analysis shows the genes corresponding to the
flavonoid biosynthesis are enriched. Please note that
the 1 refers to the index of flavonoid biosynthesis. You
may explore other pathways by changing this.
gseaplot(kkg, by = "all", title = kkg$Description[1], geneSetID = 1)
Finally we summarise the KEGG pathway correspnding to flavanoid biosynthesis and investigate the gene expression patterns of the pathway components.
require(pathview)
# Produce the native KEGG plot (PNG)
at_pv <- pathview(gene.data=kegg_gene_list,
pathway.id="ath00941",
species = kegg_org,
mid = "white",
high = "darkgreen",
low = "darkmagenta", kegg.dir = "./kegg_out/")
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Info: Working in directory /Users/akumarbasak/Documents/PA_practical_shared
## Info: Writing image file ath00941.pathview.png
knitr::include_graphics("./ath00941.pathview.png")
- You can download your favourite dataset (LogFC table). Please note that the transcriptome should be within the Org.db.
- Find the top 5 pathways that are enriched upon experimental perturbation by using both KEGG and GO term analyses.
- Check the interdependencies of the genes that are expressed within the top 5 pathways.
sessionInfo()
## R version 4.2.2 (2022-10-31)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur ... 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] DOSE_3.24.2 org.At.tair.db_3.16.0 AnnotationDbi_1.60.2
## [4] IRanges_2.32.0 S4Vectors_0.36.2 Biobase_2.58.0
## [7] BiocGenerics_0.44.0 pathview_1.38.0 enrichplot_1.18.4
## [10] clusterProfiler_4.6.2 ggridges_0.5.4 lubridate_1.9.2
## [13] forcats_1.0.0 stringr_1.5.0 dplyr_1.1.2
## [16] purrr_1.0.1 readr_2.1.4 tidyr_1.3.0
## [19] tibble_3.2.1 ggplot2_3.4.2 tidyverse_2.0.0
## [22] BiocManager_1.30.21
##
## loaded via a namespace (and not attached):
## [1] ggnewscale_0.4.9 fgsea_1.24.0 colorspace_2.1-0
## [4] ggtree_3.6.2 gson_0.1.0 qvalue_2.30.0
## [7] XVector_0.38.0 aplot_0.1.10 rstudioapi_0.15.0
## [10] farver_2.1.1 graphlayouts_1.0.0 ggrepel_0.9.3
## [13] bit64_4.0.5 scatterpie_0.2.1 fansi_1.0.4
## [16] codetools_0.2-18 splines_4.2.2 cachem_1.0.8
## [19] GOSemSim_2.24.0 knitr_1.43 polyclip_1.10-4
## [22] jsonlite_1.8.7 GO.db_3.16.0 png_0.1-8
## [25] graph_1.76.0 ggforce_0.4.1 compiler_4.2.2
## [28] httr_1.4.6 lazyeval_0.2.2 Matrix_1.5-1
## [31] fastmap_1.1.1 cli_3.6.1 tweenr_2.0.2
## [34] htmltools_0.5.5 tools_4.2.2 igraph_1.5.0
## [37] gtable_0.3.3 glue_1.6.2 GenomeInfoDbData_1.2.9
## [40] reshape2_1.4.4 fastmatch_1.1-3 Rcpp_1.0.11
## [43] jquerylib_0.1.4 vctrs_0.6.3 Biostrings_2.66.0
## [46] nlme_3.1-160 ape_5.7-1 ggraph_2.1.0
## [49] xfun_0.39 timechange_0.2.0 lifecycle_1.0.3
## [52] XML_3.99-0.14 org.Hs.eg.db_3.16.0 zlibbioc_1.44.0
## [55] MASS_7.3-58.1 scales_1.2.1 tidygraph_1.2.3
## [58] hms_1.1.3 KEGGgraph_1.58.3 parallel_4.2.2
## [61] RColorBrewer_1.1-3 curl_5.0.1 yaml_2.3.7
## [64] memoise_2.0.1 gridExtra_2.3 downloader_0.4
## [67] ggfun_0.1.1 HDO.db_0.99.1 yulab.utils_0.0.6
## [70] sass_0.4.6 stringi_1.7.12 RSQLite_2.3.1
## [73] highr_0.10 tidytree_0.4.2 BiocParallel_1.32.6
## [76] GenomeInfoDb_1.34.9 rlang_1.1.1 pkgconfig_2.0.3
## [79] bitops_1.0-7 evaluate_0.21 lattice_0.20-45
## [82] labeling_0.4.2 treeio_1.22.0 patchwork_1.1.2
## [85] shadowtext_0.1.2 cowplot_1.1.1 bit_4.0.5
## [88] tidyselect_1.2.0 plyr_1.8.8 magrittr_2.0.3
## [91] R6_2.5.1 generics_0.1.3 DBI_1.1.3
## [94] pillar_1.9.0 withr_2.5.0 KEGGREST_1.38.0
## [97] RCurl_1.98-1.12 crayon_1.5.2 utf8_1.2.3
## [100] tzdb_0.4.0 rmarkdown_2.23 viridis_0.6.3
## [103] grid_4.2.2 data.table_1.14.8 Rgraphviz_2.42.0
## [106] blob_1.2.4 digest_0.6.33 gridGraphics_0.5-1
## [109] munsell_0.5.0 viridisLite_0.4.2 ggplotify_0.1.1
## [112] bslib_0.5.0